Recovery in Massively Parallel Systems
نویسندگان
چکیده
The objective of ESPRIT-project 6731 FTMPS [1] is to develop techniques and system software to integrate Fault Tolerance in Massively Parallel Systems [2]. This covers the whole range from error detection, over fault-diagnosis and fault isolation to system and application recovery. Important is the research for applicability in massively parallel systems as well as the development of system software that may be commercialized in future products. The project-partners are: Parsytec Computer GmbH (D), British Aerospace Ltd. (UK), Katholieke Universiteit Leuven (B), Universität-GH Paderborn (D) (recently replaced by the Medizinische Universität zu Lübeck), Universität Erlangen-Nürnberg (D) and Universidade de Coimbra (P). Although the Parsytec systems (the PowerXplorer is one of them) have been the development hardware, the developed methodologies and implementations have been kept as hardware independent as possible.
منابع مشابه
RECOVERY IN MASSIVELY PARALLEL SYSTEMS 1 Recovery in Massively Parallel
The objective of ESPRIT-project 6731 FTMPS [1] is to develop techniques and system software to integrate Fault Tolerance in Massively Parallel Systems [2] . This covers the whole range from error detection, over fault-diagnosis and fault isolation to system and application recovery. Important is the research for applicability in massively parallel systems as well as the development of system so...
متن کاملFacing up to the Inevitable: Intelligent Error Recovery in Massively Parallel Processing in Memory Architectures
Massively parallel “Processing-In-Memory” (PIM) architectures have been shown to yield increases in performance due to their “memory-centric” nature. However, as PIM is still a developing technology, advanced issues such as error detection and failure recovery have not yet been addressed. We describe the application of concepts found in our multi-agent system, ADE, to PIM, incorporating its mec...
متن کاملA User-triggered Checkpointing Library for Computationintensive Applications
We propose a method to incorporate coordinated checkpointing and rollback in high performance computing applications on massively parallel computers. A library allows the user to specify which data-items (including files) belong to the contents of the checkpoint, and to trigger the checkpointing in the application. The recovery-line management on the distributed disk system takes care of which ...
متن کاملMassively Parallel Execution Model and Massively Parallel Architecture
The purposes for the research and development of the RWC massively parallel computer project are (1) to e ciently support exible and integrated computation which are research targets in RWC Project, and (2) to pursue a general purpose massively parallel system e ciently supporting multiple programming paradigms, and (3) to realize a stand{alone system which has a mature operating system. For th...
متن کاملA Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems
A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...
متن کامل